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Since mid-1971 the Instructional Objectives Exchange 
has been engaged in a major effort to develop and disseminate 
criterion-referenced tests in the fields of reading^ mathematics^ 
language arts^ and social studies. This paper isolates the chief 
technical decision-alternatives faced in this project^ such as: (1) 
the optimal number of tests to produce^ (2) choosing a well defined 
behavior domain^ (3) devising a homogeneous domain^ and (4) the high 
cost of preparing criterion-referenced tests. The author describes 
the rationale for each decision^ then appraises the adequacy of the 
decisions on the basis of empirical results. (Author/MLP) 
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TECHNICAL TRAVAILS OF DEVLLOPING CRITERION-REFERENCED TESTS'' 



W. Jnines Popham 
University of California, Los Angeles 

Despite the fact that numerous educational practitioners are be^ 
cominc) d i st'nchanted with norm- referenced achievement, measures of evalua- 
tion purjioscs, until I'ccently there have 'been few reports of large scale 
'Efforts to construct criterion-referenced tests which might serve as a]- 
teriiaLivc's. Perhaps the paucity of major projects desi(jned to develop 
cri ter ion-roferenced icsts provides a partial eKplanation for our extremely 
limited f)ra9i'ess \n devising the technical procedures needed to construct 
criterion-referenced tests. Maybe we have not yet acquired a sufficient 
experiential base to permit the invention of respectable counterparts to 
our v/ell honed procedures for establishing the validity, reliability, 
and item adequacy of norm- referenced tests. 

On the other hand, perhaps the absence of many major efforts to 
devise c r i te r i on- referenced tests is partially attributable to a recogni- 
tion of today's primitive state of cr i ter i on- referenced measurement 
n)cthofio] ogy , and a concomitant trepidation about plunging into a major 
engineering effort where the tool shed is so barren. 

But vvliether cause or effect, measurement personnel obviously con 
learn much, from those fev/ large scale projects which tiuve produced 
cr i ter i o;;- ri:iferenced tests. Recent announcements of the availability of 
criterion-referenced measures (in reading and mathematics) by such firms 
as CTB/McGraw Hill and Educational Development Corporation attest to the 
likelihood that a number of test developers v/il! soon have experiences 
to share -rand, hopefully, technical advances to trade.- 

Certainly, a number of state and local education agencies have set- 
out to enter the cr i ter i on- referenced measurer; .^nt game in a major way. 
To mentioii but a few, formidable test development activity is undervyay 
in such states as New Jersey ^ Michigan, Oregon, W-isconsin, and California 
plus a nu::iber of large metropolitan school districts throughout the 
nation. But we need to share our experiences, both successes and failures, 
in a t Le/.)pt ing to cope with the technical problems presented when one 
under takns the development of a large number of cr i ter i on- referenced 
measures. The paper, consistent with an exper ience-shcir i ng mission, 
describes practical dilemmas and efforts to provide technical solutions 
during an ongoing test development project conducted at the Instructional 
Objectives Exchange (lOX), a Los Angeles-based non-profit educationf:l 
corporc3t iori. The paper could aptly be subtitled: Con fe ss i ons of l\ 
Cr i ter lon-Rt" ferenc ed Test Developer, since it will recount, as candidly 
as possibJjtj, the dubious decisions, outright errors, and glitter':.^; 
insights (both of tliem) made during the lOX ci'i ter ion-referenced test 
developfiient enterprise. 
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Perhaps bctUM- knov/n as o depositary of measur'oblc instructional 
objectives, since to serve as such dc»pository was the reason for its 
establishment in 1968, the I fis t rue L iona 1 Objectives Exchange embarked on 
a major test development effort in Ihe summer of 1971. In an attempt to 
provide tost instruments which would nioi*e readily incline educators to 
organl/.e their instruction and evaluation activities around measurable 
learner outcomes, the lOX staff initiated a project to develop criterion- 
referenced measures in all major subject fields taught in public schools. 
Durincj the first two years of the project's operation, criterion- 
referenced tests in reading and mathematics were produced. During the 
third year, tests in language arts and U.S. Government were developed. 
Currently, more advanced tests in mathemrit ics and 'language arts are 
being prepared. At this point we have almost three years' worth of 
experience^ in producing these tests, and it is time to engage in a 
little stock-taking, soul-searching, and agonizing reappraisals. This 
paper will focus on the major problem areas which have thus far been 
isolated.' 



An Optimal Number £f Tests 

As will soon become apparent, one of the more pervas i ve problems 
facing those v;ho would work with criterion-referenced tests stems from 
our almost complete naivete regarding what level of content or skill 
genera 1 i ty should tests incorporate in order to function optimally for 
purposes of educational evaluation. Although curriculum and measurement 
specialists have reminded us for decades that this represents a critical, 
unresolved problem, we are not substantial ly closer to a solution today 
than we were at the start of this century. 

The first place the generality-level difficulty manifests itself is 
in connection with determining how many cr i ter ion- referenced tests to 
produce. Once the lOX staff had decided to develop a set of criterion- 
referenced measures v/hich would be of value to educators throughout the 
nation, the first problem we had to deal with was, "How many tests?'' 

Educators learned an important lesson from their 1960's preoccupation 
with behavioral objectives, namely, that increasing the specificity v/ith 
which an objective is stated does not necessarily increase the educational 
utility of that objective. The difficulty that arises when we try to 
make our objectives too specific is that v/e end up with literally thousands 
of such super-specific objectives. As an objective becomes more specific, 
its scope is generally reduced, thereby obliging the educator using the 
objective to keep track of more object i ves.. than is feasible. 

The situation with respect to criterion-referenced test construction 
is analogous. We could produce several hundred tests of the various 



^After the first year of the project's existence, a technical paper v;as 
produced delineating the various technical procedures employed during ^ 
the initial year of the activity: Popham, W. Jam'-^s, Proce dural Gu t del ines, 
Deve1^p_jnci IPX Obj cct j ves-Basecl Tes ts", Instructional Objec Lives Exchange, 
Los Angeles, Ca]ifornia> August, 1972. 
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5j<ills required lo assess an individual's mastery of reading compctenc ios . 
Bui', by trying Lo lest all of the isolalable skills requiruci in reading, 
we would hove created a battery of tests so awesome in bulk, let alone 
coriceinual complexity, that few sane teachers would use the tests. And 
odds oio that if teachers did use such tests for any extended period, 
all remnants of sanity v/ould vanish. 

Thus, we knew v/e had to go with fewer tests, but how many fewer was 
up for grabs. We tried to approach the problem by thinkirig about the 
number of outcoii'.es (as reflected by student test performances) that a 
typicrfl teacher could realistically monitor. We concluded, on the basis 
of si Li red- persona 1 exper i enc es , that somovyhere between a half-do/:en and 
a do;{en^ outcomes per course per year could be meaningfully dealt v;ith 
by most Leochers. This meant, for example, if lOX was developing a set 
of rtN-iding tests to be used in grades K, 1, 2, and 3> we would try to 
develop soniewhere between Ik (six tests times four grade levels) and ^8 
(twelve tests times four grade levels) instruments. 

But this mechanism for reaching a decision regarding numbers of 
tests (personal estimates of teachers' capacity to process test information) 
certainly induces little confidence. It represents' a primitive farm of 
pooled guessing to answer a question which is anjenable to an ejnpirical 
solution. There is no reason to est i male how many tests teachers can 
meaningfully monitor v/hen we can carry out studies which will dc^nonstrate 
v;hat teachers* preferences truly are. A variety of straightforward 
investigations could be concocted, for example, in v;hich illustrative 
test results of varying degrees of complexity, content coverage., etc. 
were presented to teachers for their reactions. Teachers could register 
preferences for different types of test data, different numbers of 
tests, etc. It would also be possible, though more demanding, to follow 
up teachers v^ho had been presented with differential test information to 
see vyiiich form of the information led to more meaningful teacher utiliza- 
tion of data. 

The point being made, of course, is that as long as we arc forced 
to rely on intuitive estimates of how many tests teachers can keep track 
of> our estimates are apt to be some distance from reality. We need to 
embark on a programmatic research effort to understand more fully the ■ 
level -of-gcneral i ty problem as it applies to determining the optitnal 
number of tests. 

In the meanwhile, we can eschew the extremes regarding test numbers, 
i.e., two or three tests are too few; two or three hundred testv^ are loo 
many. The limits of that tolerance band are too wide, unfortunately, to 
be of niucli help to the cr i ter ion- referenced test constructor. Those 
• individuals and agencies who do not deal with this problem prior to 
their decision to generate criterion-referenced measures, are failing to 
confront an issue that may well render their development effort ineffectual. 



Several lOX staff members possessed previous experience in bakery and 
doHML shops, leading lo a proclivity to think in terms of dozen or 
dor.^-n. 
2 



ERIC 



Choob i nc ) a Doma i n 

When the lOX staff began to create cr i ter ion- referiMiced tests we 
had to step back several paces and remind ourselves why it was that 
nonn-referonced measures were not sat i sfc^ctory for evaluative purposes. 
Although there are a niifnber of reasons why this is so, the chief deficit 
of norm-referenced meiisures is that they fail to provide a satisfactory 
descriptio n of v;hat examinees can or cannot do. Since norm- referenced 
measures yield scores interpretable according to the examinee's relative 
standing v/ith respect to a norm cjroup, it is often difficult to. obtain a 
clear' idea of what the dimension is on which examinees are differing. 
Criterion-referenced tests, on the other hafid, are used to ascertain an 
individual's status with respect to a well defined behavior doma in , that 
is, class or set of behaviors. it is this well defined behavior domain, 
in fact, which constitutes the "criterion" to which examinee performance 
is rcfcrcMiced. To the extent that a criterion-referenced test fails to 
provide an explicit description of what it is that the examinee can or 
cannot do, it offers few advantages, at least for purposes of evaluation, 
over a norm-referenced rpeasure. 

In view of the need to provide measures which yielded better descrip 
tions of learner performance, the iOX staff drew heavily on the work of 
Wells Hively and his associates3 who had been working since the mid- 
sixties to devise ways of delimiting classes of learner behaviors for 
purposes of curriculum development and test design. The i tem form 
approach used by Hively provided the IOX project with its chief model, 
although, as Hively and his cohorts used thdhi, they were typically too 
cofnplicated for sustained use, either by our item writers or by the 
public school educators who would be relying on them for interpretation 
purposes. 

We saw two criteria as important in divising our domain descriptions 
namely, c'arity and brev i ty . These two criteria are typically in con- 
flict. We tried to produce domain descriptions which were detailed 
enough to delimit the class of learner behaviors to be measured, but 
short enough tc be used for interpretive purposes by busy educators. 
The task v/ar- a difficult one. 

Because of the IOX association with Instructional objectives, v/c 
referred to our domain descriptions as " amplifi ed ob ject ives, since v.'e 
essentially elaborated on a simple statement of instructional objective 
in order to produce a domain description. 

Tlius, we settled the question of what our domain description would 
riiore or less look like, that is, a few paragraphs which attempted to 
describe the nature of (I) the stimuli presented to the examinee, (2) the 
response options available, and (3) the criteria for judging the examinee 
response. But as we set out to delineate the domains for our testr. , we 



-"^Hjvcly, Wells; Maxwell, Graha/n; l\abehl, George; Sension, Donald, and 
Lundin, Stephen. Domain-Referenced Curriculum Eva 1 ua t if.-.n ; A Technical 
I'landh'Ook and a Case Study from the Minnemost Project, Center for the 
Study of Evaluation, UCLA, 1973- 
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ran into an unexpected problem. For any cjeneral skill that we tried to 
operat lonol I ze via a more explicit statement of a learner behavior 
domain, we found that we had several clearly competitive domains. The 
trick was to decide on the best representative from the contending 
doma ins. 

Let's illustrate this proi/lcin a bit more tangibly. Suppose we 
wanted to assess wliother a student could satisfactorily solvt^ word 
problems involving elementary multiplication operations. Now one very 
clear difference in^tlie kinds of domains we might cliouse would^hinge on 
wluUher our story problems called for a constru cted learner response, as 
when the child is called on to write out answers to problems, or v;hether 
the problems required selecte d responses, as v;hen the student chooses an 
answer from several alternatives. Other variations in the domain might 
involve tlie nature of the key ingredients in the story problems, such as 
the exact kinds of eligible problem types that could be presented to the 
]e^\r\^Qry that is, what would be missing, what kinds of distractor data 
would be included in the story problem stem, and so on. 

Anyone who believes that, when setting out to build criterion- 
referenced measures, it will be obvious which domain of behaviors should 
constitute the criterion (to which test items will be referenced) is in 
for o disappointment. The alternatives will be numerous. The selection 
decisions will be difficult. 

At lOX we tried to approach the task somewhat rationally by setting 
forth criteria to guide our developers. The following six considerations 
were to be used in deciding on domains: 

]. Genera 1 Acceptance . Is the behavior domain considered 
important by teachers, subject matter specialists, the 
publ i c? 

2« Transf erability Wit hin the Domain. Since the learner 
behavior measured by the test is highly specific, will 
that behavior, when mastered by the learner, be likely 
to transfer to similar skills (other domains) within 
that general class of behavior desired? 

3 * Trans fer ab i 1 i ty Outs ide t he Dom a in. Will tlie doma i n , 

once mastered, be likely to transfer to learner behaviors 
required in rather different types of behavior domains? 
Termi na 1 i ty.. !f the organ i zat ion of learner behavior is 
hierarchical, will the domain selected tend to be terminal 
ratfier than en route? 

5* Amcnabi 1 i ty to Instruction . Will the domain measure a 

learner skill that can be taught, rather than a native trait 
rela t i vely immune to i ns true t ion? 

^* Ease of Scorabi 1 i ty . Other factors being equal, vn 1 1 
tlie domain selected yield learner responses wliicli can 
be easily scored, not necessarily objectively scored, 
by tfiose educators using the tests? 
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Qui evan though our staff has attempted to use those criteria in 
cIccidiiK) which domains to select (create), we have not subjected our 
decisions to any kind of empirical ver I f i cn t i on . The most important 
question we should probably ask Is, '^How (janeral i zabl e will the skill be 
that is reflected by ifie learner's mastery of this particular domain?'' 
It would appear Lhal this question is amenable to an empirical answer 
and that the methodolc^yy for Lreating this issue v;i 1 1 resemble but not 
coincide with those Uictics ordimrily employed when one attempts to 
secure construct yalidity data This question is treated at greater 
length elsewliere,* but it is a critical concern for those devising 
criterion-referenced measures. Technical procedures for dealing with 
this problem are desperately needed. 



Doma i n Homocienei ty 

A third technical problem area we have encountered in our lOX test 
development operations arises from the same difficulty we encountered 
when dc^ciding on how many tests to prepare, that is, the level of gener- 
ality qu(.'.stion. While it might be possible to lump both problems under 
the fia-ndy rubric of generality-level, the problem takes a different 
shape vihcn faced by those who must generate dorr.ain definitions. Let's 
try to clarify the issues. 

In the abstract, it would be desirable to devise criterion- 
referenced test domains so that they permit the generation of a pool of 
items whi^^li would not only a ppear to be homogeneous but would also, on 
the b jf empirical data, func t ion in a homogeneous manner. Yet, in 

order . .reate domain definitions which possess these qualities, we 
found tliat we were isolating only a very discrete type of learner behavior. 
For example, suppose we are dealing with a domain which taps the learner's 
ability to divide words into their constituent syllables. Now when one 
analyzes the various kinds of syllabication tasks v;hich might be included 
in a domain description, even assuming we had decided how the ] earner 
would identify syllables, there are severa 1 d i see rn i b 1 y different sorts 
of tasks which might be included. Most of these would be contingent on 
the kinds of words with which the learner v/f 1 1 be presented. There are, 
for instance, simple tv/o-syl lable words in which the syl lable-'div ision 
task is straightforv/ard, e.g., Oxford. Then there are ■ two-sy 1 lable 
words whore the novice could not quickly discern whether a middle consonant 
belongs v;ith the first or second syllable, e.g., color. There are a 
number of other classes of words with varying numbers of syllables and 
varying rules for syllabication. Now the technical decision facing the 
domain writer is vjhether to include these different sorts of syllabication 
tasks in a single domain, moro or less randomly, or to deliberately try' 
to include a preset number of sucli tasks. 

Some would recoimiiP.nd that we adopt a diagnos tic strategy by including 
a number of the different subskilis in a single dom<'.Mr) ;i) order to 
detect which particular subskills the learnur can perform. But vjUcm you 
do this, of course, you're obligated to include a reasonable number of 
the learner's prowess v/ith res[n:.ct to each subsklll. This leads to 
lengthy tests, too lengthy somclinies to be practical. 



O Pupham, W. James, Sy si.ematic F.d ucational evaluation, manuscript In 
ERJC preparation. 
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Othorii recorniiuMul thai we nc!o))l a int'cM^t'^Y^ •i^t^f^^fcrij^, tossing in 1 
kinds of tiiffcrcnl iU;m types vvitti tlio nnlicipotlon tluit some sort: of 
cjlobol estinicite of leornor skill will eiiicrcjo. The prol)lcinwith t:his 
approach, of course, is that with only a mocior.t number of items per 
ciomoifi, a few items roj:»resen t i ng o particular subskill can inequitably 
influence I he learner':; per f orfiinnce on the domain, and we will be unaware 
of wfiat subskills are influenciiicj a score on the total domain. 

• A third possibility is to opt for a tcrini na 1 skill strategy where 
We isolate only the lcarnc:;r skill which Is the most terminal, then 
measure tliat skill. Typically, of course, this leads to measuring the 
fiiost difficult of thostj skills which might be included in the domain. 
While this approach chcjracter i st i ca 1 1 y yields a homogeneous domain, it 
is obviously of little value for any type of diagnostic purpose. 

I v^ould like to t;e able to report that the tOX staff, having con- 
sidered (.hose alternatives rationally, has completed a series of empirical 
i nves t i cjcj I ions which demonstrate conclusively what the proper approach 
sliould bi.:. I would also like to be ten years younger, twenty years 
wiser/ and forty times wealthier. The distressing truti] is, however, 
that in the lOX project we have adopted a stance of unflinching vascilla- 
tiofK We really don't know how to proceed with respect to this problem. 



Costs and Consc ience 

In 1970 lOX began to produce criterion-referenced tests because, 
frankly, we were tired of waiting around for major test publishers to do 
it. Wo were well aware of the deficiencies of norm- referenced measures 
for purposes of educational evaluation, but could find few commercially 
available alternatives to such tests. We were unwilling to continue to 
tell educators that, while norm- referenced tests were to be avoided, 
cr i tor ion- referenced counterparts would be available only when the 
measurement princes in Princeton, New Jersey (and elsewhere) got off 
the i r duffs • 

So, vvc went into the criterion-referenced test devojlopment business. 
What an expensive business it is! Although any kind of test prepa rat ion 
is costly, v.'e had no idea how much personnel time would be tied up in 
devising domain descriptions, generating hopefully congruent items and 
monitoring item-domain cong'ruoncy. Maybe the measurement princes weren't 
so dumb. 

The choice facino us quickly became apparent. Either v;e could 
proceed v/ith our developmental activities, having recognized that our 
sel f'-support f i nancial . base was far too modest to permit the kind of 
quality we wanted, or we could fold up oiu' development tents with the 
hope that other, better funded organ I :£at ions would gt^.l. ciround to producing 
decent criterion-referenced measures. It was a real choice point i'or 
us. It v^ill be a real choice [)oint for most I'^gcncies that embark on a 
criterion-referenced i.est development effort. 
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As might be inferred from the Fact that our tost development efforts 
ore still underway, the lOX staff decided to go ahead with their test 
creation activities. In retrospccL, oven thoucjh we suffer periodic 
conscience pangs because our tests and domain descriptions aren't as 
flawless as we wish, I am pleased v/e decided to continue. A number of 
the commercially distrlhuLed cr i ter ion- referenced tests now making their 
way to the educational market suffer from inadequate or non-existent 
domain descriptions as v/ell as an inadequate number of items per domain 
(I OX testt. have either five or tan I terns for each domain), Tlie I OX 
tests at lea^^t provide an 1 terna t i ve. 

The more fundamental question emerging from this problem, howevc 
is whethei* a. commercial or nonprofit test development agency, san s 
subs tan t ia 1 externa I s ubs id ies , can engage in the development of truly 
high quality criterion-referenced tests? I believe, although there will 
be occasional exceptions to the rule, that the answer is no. 



A Technica 1 W asteland 

This leads to a rel3ted, and final, observation. There is a growing 
force within the educational community to turn from norm- referenced 
measures with their technical deficits (for evaluation purposes) and to 
replace them with more sensitive criterion-referenced assessment devices. 
That's just fine. 

At least it would be just fine if we had any assurance that the 
criterion-referenced measures which will be produced in the next few 
years will be capable of doing the jobs educators want accomplished. 
Unfortunately, because the technological support base for criterion- 
referenced measurement development is so perilously weak at present, our 
predictions regarding the quality of future criterion-referenced tests 
must be gloomy. 

V/ha t we need, and £ov7, _i_s_ £ wel 1 f i nanced , governmental ly- ini t iate d 
pro ject to ex pand our wea k technol og ical base i n th i s cruc t a 1 measure ment 
area. Ther~are, at this writing, no major efforts underway to sharpen 
the "techn i ca 1 tools needed to produce better criterion-referenced measures 
.Because of the pivotal role to be played by such measures in all sorts 
of evaluation ond accountability programs, this situation is intolerable. 

We must, without delay, muster whatever clout we have in order to 
encourage the National Institute of Education or some comparabl.e age acy 
to foster the kind of technical-support activities which will leac' to o 
reduction of the technical travails associated with developing criterior- 
referenced tests. The decisions which these tests will influence are 
too important to treat the tests vvith clumsy crowbars. Wc need scalpels. 
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